Add truncated_rows parameter to register_csv() and read_csv()#1359
Add truncated_rows parameter to register_csv() and read_csv()#1359djouallah wants to merge 4 commits intoapache:mainfrom
truncated_rows parameter to register_csv() and read_csv()#1359Conversation
Exposes the truncated_rows parameter from Rust DataFusion to Python bindings. This enables reading CSV files with inconsistent column counts by creating a union schema and filling missing columns with nulls. The parameter was added to DataFusion Rust in PR apache/datafusion#17553 and is now available in datafusion 51.0.0. Changes: - Add truncated_rows parameter to SessionContext.register_csv() - Add truncated_rows parameter to SessionContext.read_csv() - Add comprehensive tests for both methods - Update docstrings with parameter documentation Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The tests now only verify that the truncated_rows parameter is accepted by the Python bindings, not the actual behavior. Behavior testing is an upstream DataFusion concern (apache/datafusion#17553). This follows the principle that Python bindings should expose all Rust API parameters regardless of upstream implementation status.
|
Thank you for the PR. How would you feel about making a more general solution as described in #1358 ? If we're updating this, we could ensure we have all of the options exposed to our users. |
|
@timsaucer i am not comfortable yet with this whole thing, I know what you want :) and it make perfect sense, but i don't want to get too excited and do silly thing yet :) |
Ok, understood. I think it would be better if we added a more general solution instead of just adding one piece at a time, though. Maybe I will take a swing at this later today. |
|
What do you think about going with something more like #1361 |
|
Thanks that a better solution 😃 |
Summary
Exposes the
truncated_rowsparameter from DataFusion Rust to Python bindings forregister_csv()andread_csv()methods. This parameter enables reading CSV files with inconsistent column counts by creating a union schema and filling missing columns with nulls.Background
The
truncated_rowsfeature was added to DataFusion Rust in apache/datafusion#17553 (merged October 8, 2025) and is available in DataFusion 51.0.0.Current workaround: Users can already use
truncated_rowsvia SQL with external tables:Problem: SQL
LOCATIONclause does not support lists of file paths as separate arguments :(Solution:
register_csv()andread_csv()accept Python lists of paths, making it much more ergonomic:Changes
truncated_rows: bool = Falseparameter toSessionContext.register_csv()truncated_rows: bool = Falseparameter toSessionContext.read_csv()src/context.rspython/datafusion/context.pyExample Usage
Testing
Tests verify that the
truncated_rowsparameter is accepted by the Python bindings. The actual behavior of the feature is tested in the upstream DataFusion repository.This follows the principle that Python bindings should expose all Rust API parameters, and behavior testing is the responsibility of the upstream DataFusion library.
Backward Compatibility
✅ Non-breaking change. The parameter defaults to
False, maintaining existing behavior.Related